85 research outputs found

    Learning the Language of Chemical Reactions – Atom by Atom. Linguistics-Inspired Machine Learning Methods for Chemical Reaction Tasks

    Get PDF
    Over the last hundred years, not much has changed how organic chemistry is conducted. In most laboratories, the current state is still trial-and-error experiments guided by human expertise acquired over decades. What if, given all the knowledge published, we could develop an artificial intelligence-based assistant to accelerate the discovery of novel molecules? Although many approaches were recently developed to generate novel molecules in silico, only a few studies complete the full design-make-test cycle, including the synthesis and the experimental assessment. One reason is that the synthesis part can be tedious, time-consuming, and requires years of experience to perform successfully. Hence, the synthesis is one of the critical limiting factors in molecular discovery. In this thesis, I take advantage of similarities between human language and organic chemistry to apply linguistic methods to chemical reactions, and develop artificial intelligence-based tools for accelerating chemical synthesis. First, I investigate reaction prediction models focusing on small data sets of challenging stereo- and regioselective carbohydrate reactions. Second, I develop a multi-step synthesis planning tool predicting reactants and suitable reagents (e.g. catalysts and solvents). Both forward prediction and retrosynthesis approaches use black-box models. Hence, I then study methods to provide more information about the models’ predictions. I develop a reaction classification model that labels chemical reaction and facilitates the communication of reaction concepts. As a side product of the classification models, I obtain reaction fingerprints that enable efficient similarity searches in chemical reaction space. Moreover, I study approaches for predicting reaction yields. Lastly, after I approached all chemical reaction tasks with atom-mapping independent models, I demonstrate the generation of accurate atom-mapping from the patterns my models have learned while being trained self-supervised on chemical reactions. My PhD thesis’s leitmotif is the use of the attention-based Transformer architecture to molecules and reactions represented with a text notation. It is like atoms are my letters, molecules my words, and reactions my sentences. With this analogy, I teach my neural network models the language of chemical reactions - atom by atom. While exploring the link between organic chemistry and language, I make an essential step towards the automation of chemical synthesis, which could significantly reduce the costs and time required to discover and create new molecules and materials

    Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design

    Full text link
    Generative molecular design has moved from proof-of-concept to real-world applicability, as marked by the surge in very recent papers reporting experimental validation. Key challenges in explainability and sample efficiency present opportunities to enhance generative design to directly optimize expensive high-fidelity oracles and provide actionable insights to domain experts. Here, we propose Beam Enumeration to exhaustively enumerate the most probable sub-sequences from language-based molecular generative models and show that molecular substructures can be extracted. When coupled with reinforcement learning, extracted substructures become meaningful, providing a source of explainability and improving sample efficiency through self-conditioned generation. Beam Enumeration is generally applicable to any language-based molecular generative model and notably further improves the performance of the recently reported Augmented Memory algorithm, which achieved the new state-of-the-art on the Practical Molecular Optimization benchmark for sample efficiency. The combined algorithm generates more high reward molecules and faster, given a fixed oracle budget. Beam Enumeration is the first method to jointly address explainability and sample efficiency for molecular design

    Transformers and Large Language Models for Chemistry and Drug Discovery

    Full text link
    Language modeling has seen impressive progress over the last years, mainly prompted by the invention of the Transformer architecture, sparking a revolution in many fields of machine learning, with breakthroughs in chemistry and biology. In this chapter, we explore how analogies between chemical and natural language have inspired the use of Transformers to tackle important bottlenecks in the drug discovery process, such as retrosynthetic planning and chemical space exploration. The revolution started with models able to perform particular tasks with a single type of data, like linearised molecular graphs, which then evolved to include other types of data, like spectra from analytical instruments, synthesis actions, and human language. A new trend leverages recent developments in large language models, giving rise to a wave of models capable of solving generic tasks in chemistry, all facilitated by the flexibility of natural language. As we continue to explore and harness these capabilities, we can look forward to a future where machine learning plays an even more integral role in accelerating scientific discovery

    Bayesian Optimization for Chemical Reactions

    Get PDF
    Reaction optimization is challenging and traditionally delegated to domain experts who iteratively propose increasingly optimal experiments. Problematically, the reaction landscape is complex and often requires hundreds of experiments to reach convergence, representing an enormous resource sink. Bayesian optimization (BO) is an optimization algorithm that recommends the next experiment based on previous observations and has recently gained considerable interest in the general chemistry community. The application of BO for chemical reactions has been demonstrated to increase efficiency in optimization campaigns and can recommend favorable reaction conditions amidst many possibilities. Moreover, its ability to jointly optimize desired objectives such as yield and stereoselectivity makes it an attractive alternative or at least complementary to domain expert-guided optimization. With the democratization of BO software, the barrier of entry to applying BO for chemical reactions has drastically lowered. The intersection between the paradigms will see advancements at an ever-rapid pace. In this review, we discuss how chemical reactions can be transformed into machine-readable formats which can be learned by machine learning (ML) models. We present a foundation for BO and how it has already been applied to optimize chemical reaction outcomes. The important message we convey is that realizing the full potential of ML-augmented reaction optimization will require close collaboration between experimentalists and computational scientists

    Reaction classification and yield prediction using the differential reaction fingerprint DRFP.

    Get PDF
    Predicting the nature and outcome of reactions using computational methods is a crucial tool to accelerate chemical research. The recent application of deep learning-based learned fingerprints to reaction classification and reaction yield prediction has shown an impressive increase in performance compared to previous methods such as DFT- and structure-based fingerprints. However, learned fingerprints require large training data sets, are inherently biased, and are based on complex deep learning architectures. Here we present the differential reaction fingerprint DRFP. The DRFP algorithm takes a reaction SMILES as an input and creates a binary fingerprint based on the symmetric difference of two sets containing the circular molecular n-grams generated from the molecules listed left and right from the reaction arrow, respectively, without the need for distinguishing between reactants and reagents. We show that DRFP performs better than DFT-based fingerprints in reaction yield prediction and other structure-based fingerprints in reaction classification, reaching the performance of state-of-the-art learned fingerprints in both tasks while being data-independent

    A comparison of microtensile and microcompression methods for studying plastic properties of nanocrystalline electrodeposited nickel at different length scales

    Get PDF
    A comparison of microcompression and microtensile methods to study mechanical properties of electrodeposited nanocrystalline (nc) nickel has been performed. Microtensile tests that probe a volume of more than 2 × 106 μm3 show reasonable agreement with results from microcompression tests that probe much smaller volumes down to a few μm3. Differences between the two uniaxial techniques are discussed in terms of measurements errors, probed volume and surface effects, strain rate, and influence of stress state. Uniaxial solicitation in compression mode revealed several advantages for studying stress-strain propertie

    A comparison of microtensile and microcompression methods for studying plastic properties of nanocrystalline electrodeposited nickel at different length scales

    Get PDF
    A comparison of microcompression and microtensile methods to study mechanical properties of electrodeposited nanocrystalline (nc) nickel has been performed. Microtensile tests that probe a volume of more than 2 × 106 μm3 show reasonable agreement with results from microcompression tests that probe much smaller volumes down to a few μm3. Differences between the two uniaxial techniques are discussed in terms of measurements errors, probed volume and surface effects, strain rate, and influence of stress state. Uniaxial solicitation in compression mode revealed several advantages for studying stress–strain properties

    Parvalbumin: calcium and magnesium buffering in the distal nephron

    Get PDF
    Parvalbumin (PV) is a classical member of the EF-hand protein superfamily that has been described as a Ca2+ buffer and Ca2+ transporter/shuttle protein and may also play an additional role in Mg2+ handling. PV is exclusively expressed in the early part of the distal convoluted tubule in the human and mouse kidneys. Recent studies in Pvalb knockout mice revealed a role of PV in the distal handling of electrolytes: the lack of PV was associated with a mild salt-losing phenotype with secondary aldosteronism, salt craving and stronger bones compared with controls. A link between the Ca2+-buffering capacity of PV and the expression of the thiazide-sensitive Na+-Cl− cotransporter was established, which could be relevant to the regulation of sodium transport in the distal nephron. Variants in the PVALB gene that encodes PV have been described, but their relevance to kidney function has not been established. PV is also considered a reliable marker of chromophobe carcinoma and oncocytoma, two neoplasms deriving from the distal nephron. The putative role of PV in tumour genesis remains to be investigate
    • …
    corecore